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Abstract. Gaussian mixture models are central to classical statistics, 
widely used in the information sciences, and have a rich mathematical 
structure. We examine their maximum likelihood estimates through the 
lens of algebraic statistics. The MLE is not an algebraic function of the 
data, so there is no notion of ML degree for these models. The criti¬ 
cal points of the likelihood function are transcendental, and there is no 
bound on their number, even for mixtures of two univariate Gaussians. 
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1 Introduction 

The primary purpose of this paper is to demonstrate the result stated in the title: 

Theorem 1. The maximum likelihood estimators of Gaussian mixture mod¬ 
els are transcendental functions. More precisely, there exist rational samples 
Xi,X 2 , ■ ■ ■ ,xn in Q" whose maximum likelihood parameters for the mixture of 
two n-dimensional Gaussians are not algebraic numbers over Q. 

The principle of maximum likelihood (ML) is central to statistical inference. 
Most implementations of ML estimation employ iterative hill-climbing methods, 
such as expectation maximization (EM). These can rarely certify that a globally 
optimal solution has been reached. An alternative paradigm, advanced by alge¬ 
braic statistics [8], is to find the ML estimator (MLE) by solving the likelihood 
equations. This is only feasible for small models, but it has the benefit of being 
exact and certifiable. An important notion in this approach is the ML degree, 
which is defined as the algebraic degree of the MLE as a function of the data. 
This rests on the premise that the likelihood equations are given by polynomials. 

Many models used in practice, such as exponential families for discrete or 
Gaussian observations, can be represented by polynomials. Hence, they have an 
ML degree that serves as an upper bound for the number of isolated local maxima 
of the likelihood function, independently of the sample size and the data. The ML 
degree is an intrinsic invariant of a statistical model, with interesting geometric 
and topological properties |l,Sj . The notion has proven useful for characterization 


of when MLEs admit a ‘closed form 'M- When the ML degree is moderate, these 
exact tools are guaranteed to find the optimal solution to the ML problem mM- 

However, the ML degree of a statistical model is only defined when the MLE is 
an algebraic function of the data. Theorem[2means that there is no ML degree for 
Gaussian mixtures. It also highlights a fundamental difference between likelihood 
inference and the method of moments mm- The latter is a computational 
paradigm within algebraic geometry, that is, it is based on solving polynomial 
equations. ML estimation being transcendental means that likelihood inference 
in Gaussian mixtures is outside the scope of algebraic geometry. 

The proof of Theorem [l] will appear in Sectionj^ In Sectionj^we shed further 
light on the transcendental nature of Gaussian mixture models. We focus on 
mixtures of two univariate Gaussians, the model given in Q below, and we 
present a family of data points on the real line such that the number of critical 
points of the corresponding log-likelihood function ([^ exceeds any bound. 

While the MLE for Gaussian mixtures is transcendental, this does not mean 
that exact methods are not available. Quite to the contrary. Work of Yap and his 
collaborators in computational geometry m convincingly demonstrates this. 
Using root bounds from transcendental number theory, they provide certified 
answers to geometric optimization problems whose solutions are known to be 
transcendental. Theorem opens up the possibility of transferring these tech¬ 
niques to statistical inference. In our view, Gaussian mixtures are an excellent 
domain of application for certified computation in numerical analytic geometry. 


2 Reaching Transcendence 


Transcendental number theory is a field that furnishes tools for deciding 

whether a given real number r is a root of a nonzero polynomial in Q[t]. If this 
holds then t is algebraic; otherwise r is transcendental. Eor instance, = 

4.059964873... is algebraic, and so are the parameter estimates computed by 
Pearson in his 1894 study of crab data m- By contrast, the famous constants 
TT = 3.141592653... and e = 2.718281828... are transcendental. Our proof will be 
based on the following classical result. A textbook reference is |21 Theorem 1.4]: 

Theorem 2 (Lindemann-Weierstrass). Ifui,...,Ur are distinct algebraic 
numbers then ,. • ■, e“'' are linearly independent over the algebraic numbers. 


Eor now, consider the case of n = 1, that is, mixtures of two univariate 
Gaussians. We allow mixtures with arbitrary means and variances. Our model 
then consists of all probability distributions on the real line K with density 
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It has five unknown parameters, namely, the means ^ 1,^2 G IR, the standard 
deviations cri,(T 2 > 0, and the mixture weight a G [0,1]. The aim is to estimate 
the five model parameters from a collection of data points xi, 0 : 2 ,..., a;iv G M. 












The log-likelihood function of the model Q is 
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This is a function of the five parameters, while xi,... ,xn are fixed constants. 

The principle of maximum likelihood suggests to find estimates by maximiz¬ 
ing the function i over the hve-dimensional parameter space 0 = [0,1] x K">o- 

Remark 1. The log-likelihood function ^ in (H is never bounded above. To see 
this, we argue as in [H Section 9.2.1]. Set TV = 2, fix arbitrary values ao S [0,1], 
/i 2 o G K and (T 20 > 0, and match the first mean to the first data point = xi. 
The remaining function of one unknown cti equals 
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The lower bound tends to 00 as cti —)■ 0. 


Remark [T] means that there is no global solution to the MLE problem. This 
is remedied by restricting to a subset of the parameter space O. In practice, 
maximum likelihood for Gaussian mixtures means computing local maxima of 
the function These are found numerically by a hill climbing method, such as 
the EM algorithm, with particular choices of starting values. See Section]^ This 
method is implemented, for instance, in the R package MCLUST [9]. In order 
for Theoremto cover such local maxima, we prove the following statement: 

There exist samples Xi,...,Xn G Q such that every non-trivial critical 
point (d,/ti,/t 2 , di, ( 72 ) of the log-likelihood function £ in the domain 0 
has at least one transcendental coordinate. 


Here, a critical point is non-trivial if it yields an honest mixture, i.e. a distribu¬ 
tion that is not Gaussian. By the identifiability results of [20] , this happens if and 
only if the estimate (d, /ti, /t 2 , di, d 2 ) satisfies 0 < d < 1 and (/ti, di) 7 ^ (/i 2 , d 2 ). 

Remark 2. The log-likelihood function always has some algebraic critical points, 
for any xi,..., xyi G Q. Indeed, if we define the empirical mean and variance as 

N TV 
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then any point (d, pi, jl 2 , di, d 2 ) with pi = jl 2 = x and di = d 2 = s is critical. 
This gives a Gaussian distribution with mean x and variance s^, so it is trivial. 


Proof (of Theorem^. First, we treat the univariate case. Gonsider the partial 
derivative of ([^ with respect to the mixture weight a: 
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Clearing the common denominator 
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Letting ai = a and 02 = 1 — <a, we may rewrite the left-hand side of Q as 
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We expand the products, collect terms, and set Ni(k) = |{j : kj = i}|. With 
this, the partial derivative dijda is zero if and only if the following vanishes: 
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Let (d, fii,ft 2 , di, (J 2 ) be a non-trivial isolated critical point of the likelihood 
function. This means that 0 < (d < 1 and (/ti, di) 7 ^ (/t 2 , d 2 ). This point depends 
continuously on the choice of the data Xi,X 2 , ■ ■ ■ ,xn. By moving the vector with 
these coordinates along a general line in the mixture parameter a moves 
continuously in the critical equation d£/da = 0 above. By the Implicit Function 
Theorem, it takes on all values in some open interval of K, and we can thus 
choose our data points Xi general enough so that a is not an integer multiple of 
1/iV. We can further ensure that the last sum above is a Q(Q;)-linear combination 
of exponentials with nonzero coefficients. 

Suppose that (d, /li, / 12 , di, (T 2 ) is algebraic. The Lindemann-Weierstrass The¬ 
orem implies that the arguments of exp are all the same. Then the 2^ numbers 
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are all identical. However, for N > 3, and for general choice of data Xi,..., xn 
as above, this can only happen if (/ti,(Ti) = (/t 2 ,d 2 ). This contradicts our hy¬ 
pothesis that the critical point is non-trivial. We conclude that all non-trivial 
critical points of the log-likelihood function ([^ are transcendental. 

In the multivariate case, the model parameters comprise the mixture weight 
a G [0,1], mean vectors ^ 1,^2 G R” and positive definite covariance matrices 
G Arguing as above, if a non-trivial critical (d, /Ji, /J 2 , ^ 1 , ^ 2 ) is 
algebraic, then the Lindemann-Weierstrass Theorem implies that the numbers 

N 

-fik,), kG { 1 , 2 }^ , 

J=i 

are all identical. For N sufficiently large and a general choice of cci,..., in R", 
the 2^ numbers are identical only if (/ii, Ai) = (/I 2 , A 2 ). Again, this constitutes 
a contradiction to the hypothesis that (d, /ti, /t 2 , Ai, U 2 ) is non-trivial. □ 

Many variations and specializations of the Gaussian mixture model are used 
in applications. In the case n = 1, the variances are sometimes assumed equal, 
so (Ti = (72 for the above two-mixture. This avoids the issue of an unbounded 
likelihood function (as long as TV > 3). Our proof of Theorem applies to this 
setting. In higher dimensions (n > 2), the covariance matrices are sometimes as¬ 
sumed arbitrary and distinct, sometimes arbitrary and equal, but often also have 
special structure such as being diagonal. Various default choices are discussed in 
the paper [5] that introduces the R package MCLUST. Our results imply that 
maximum likelihood estimation is transcendental for all these MCLUST models. 

Example 1. We illustrate Theoremj^for a specialization of Q obtained by fixing 
three parameters: /i 2 = 0 and a\ = (J 2 = \l\f2. The remaining two free param¬ 
eters are a and pi = pLi. We take only N = 2 data points, namely xi = 0 and 
X 2 = X > 0. Omitting an additive constant, our log-likelihood function equals 

£{a, pt) = log(Q; • e~^ + (1 — a)) -|- log(a • -I- (1 — a) • e~^ ) . (6) 

For a concrete example take x = 2. The graph of Q for this choice is shown 
in FigureBy maximizing £{a, pi) numerically, we find the parameter estimates 

d = 0.50173262959803874... and /i = 1.95742494230308167... (7) 

Our technique can be applied to prove that d and fl are transcendental over Q. 
We illustrate it for /}. 

For any x G M, the function £{a,pi) is bounded from above and achieves its 
maximum on [0,1] x K. If x > 0 is large, then any global maximum (d, p) of £ is 
in the interior of [0,1] x R and satisfies 0 < /i < x. According to a Mathematica 
computation, the choice x > 1.56125... suffices for this. Assume that this holds. 
Setting the two partial derivatives equal to zero and eliminating the unknown a 
in a further Mathematica computation, the critical equation for p, is found to be 

(x - - X + ^ 0 . (8) 



Fig. 1. Graph of the log-likelihood function for two data points xi = 0 and X 2 = 2. 


Suppose for contradiction that both x and jl are algebraic numbers over Q. 
Since 0 < fl < x, we have —fl{2x — fl) < 0 < Hence ui = p?, U 2 = 0 and 
U 3 = —p(2x — p) are distinct algebraic numbers. The Lindemann-Weierstrass 
Theorem implies that , e“^ and e“3 are linearly independent over the field of 
algebraic numbers. However, from Q we know that 

(a: — /I) • — X • e“^ -|- /t • e“^ = 0 . 

This is a contradiction. We conclude that the number p is transcendental over Q. 


3 Many Critical Points 

Theorem [T] shows that Gaussian mixtures do not admit an ML degree. This 
raises the question of how to find any bound for the number of critical points. 

Problem 1. Does there exist a universal bound on the number of non-trivial 
critical points for the log-likelihood function of the mixture of two univariate 
Gaussians? Or, can we find a sequence of samples on the real line such that the 
number of non-trivial critical points increases beyond any bound? 

We shall resolve this problem by answering the second question affirmatively. 
The idea behind our solution is to choose a sample consisting of many well- 
separated clusters of size 2. Then each cluster gives rise to a distinct non-trivial 
critical point (d, pi, p 2 , ^ 1 ,^ 2 ) of the log-likelihood function £ from ([^. We pro¬ 
pose one particular choice of data, but many others would work too. 

Theorem 3. Fix sample size N = 2K for K > 2, and take the ordered sample 
(xi,... , X2k) = (1,1-2, 2, 2.2, ... ,K, K+0.2). Then, for eaeh k G {1, • ■ ■ ,K}, 
the log-likelihood funetion £ from ([^ has a non-trivial eritical point with k < 
pi < k -\- 0.2. Henee, there are at least K non-trivial critical points. 



Table 1. Seven critical points of the log-likelihood function in Theoremj^with K = 7. 
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log-likelihood 

1 

0.1311958 

1.098998 

4.553174 

0.09999497 

1.746049 

-27.2918782147578 

2 

0.1032031 

2.097836 

4.330408 

0.09997658 

1.988948 

-28.6397463805501 

3 

0.07883084 

3.097929 

4.185754 

0.09997856 

2.06374 

-29.1550277534757 

4 

0.06897294 

4.1 

4.1 

0.1 

2.07517 

-29.2858981551065 

5 

0.07883084 

5.102071 

4.014246 

0.09997856 

2.06374 

-29.1550277534757 

6 

0.1032031 

6.102164 

3.869592 

0.09997658 

1.988948 

-28.6397463805501 

7 

0.1311958 

7.101002 

3.646826 

0.09999497 

1.746049 

-27.2918782147578 


Before turning to the proof, we offer a numerical illustration. 


Example 2. For K = 7, we have iV = 14 data points in the interval [1,7.2]. 
Running the EM algorithm (as explained in the proof of Theorem below) 
yields the non-trivial critical points reported in Table Their pi coordinates 
are seen to be close to the cluster midpoints k + 0.1 for all k. The observed 
symmetry under reversing the order of the rows also holds for all larger K. 

Our proof of Theorem will be based on the EM algorithm. We first recall 
this algorithm. Let be the mixture density from Q, and let 

be the two Gaussian component densities. Define 
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which can be interpreted as the conditional probability that data point Xi belongs 
to the first mixture component. Further, define Ni = X]i=i 7* -^2 = N — 

Ni, which are expected cluster sizes. Following Section 9.2.2], the likelihood 
equations for our model can be written in the following fixed-point form: 
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In the present context, the EM algorithm amounts to solving these equations 
iteratively. More precisely, consider any starting point {a, pi, p 2 ,o'i,o' 2 ). Then 
the E-step (“expectation”) computes the estimated frequencies 7 ^ via In 



















the subsequent M-step (“maximization”), one obtains a new parameter vector 
(a,/ii,/i2, CTi, CT2) by evaluating the right-hand sides of the equations (10)-(12). 
The two steps are repeated until a fixed point is reached, up to the desired nu¬ 
merical accuracy. The updates never decrease the log-likelihood. For our problem 
it can be shown that the algorithm will converge to a critical point; see e.g. m- 


Proof (of Theorem^. FixfcG{l,...,if}. We choose starting parameter values 
to suggest that the pair {x 2 k-i,X 2 k) = {k,k + 0.2) belongs to the first mixture 
component, while the rest of the sample belongs to the second. Explicitly, we set 
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2 
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K'^ + 1.2K -2k-0.2 
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We shall argue that, when running the EM algorithm, the parameters will always 
stay close to these starting values. Specifically, we claim that throughout all EM 
iterations, the parameter values satisfy the inequalities 


1 1 

- < a < — , 
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0.09 < Ail - fc < 0.11, 0.099 < (Tl < 0.105, 
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The starting values proposed above obviously satisfy the inequalities in (13), and 
it is not difficult to check that (141 a nd ([T^ are satisfied as well. To prove the 
theorem, it remains to show that (13)-(15) continue to hold after an EM update. 

In the remainder, we assume that K > 22. For smaller values of K the claim 
of the theorem can be checked by running the EM algorithm. In particular, for 
K > S, the second standard deviation satisfies the simpler bounds 
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A key property is that the quantity 7^, computed in the E-step, is always 
very close to zero for i ^ 2k — 1, 2k. To see why, rewrite © as 
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Since a < l/K, we have VV > AT — 1. On the other hand, ^ . 

Using that K > 22, their product is thus bounded below by 0.3209. Turning to 
































the exponential term, the second inequality in ( 13 ) implies that \xi —/ri| > 0.89 


for i = 2fc — 2 or z = 2/c + 1, which index the data points closest to the kth pair. 


Using ( 16 ), we obtain 
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> 67.86 


From > 5.4 • 10 ^^, we deduce that 7^ < 10 “^"^. The exponential term 

becomes only smaller as the considered data point Xi move away from the /cth 
pair. As \i — { 2 k — 1/2) | increases, 7^ decreases and can be bounded above by 
a geometric progression starting at 10 “^"^ and with ratio 10 “®^. This makes 7^ 
with i 7^ 2 fc, 2 fc — 1 negligible. Indeed, from the limit of geometric series, we have 


Si = ^ < W" , ( 17 ) 

i^2k-l,2k 


and similarly, S2 = Z(i/2fe-i.2/c satisfies 

|s2| = |72 /c-2( — 0 . 8 ) + 72fc+l(l) + 72fe-3( — 1 ) + 72fe+2(l-2) + .. .| < 10 “^^ . 

( 18 ) 

The two sums si and S2 are relevant for the M-step. 

The probabilities 72^-1 and ^2k give the main contribution to the averages 
that are evaluated in the M-step. They satisfy 0.2621 < 72^-1,72/0 < 0 . 9219 . 
Moreover, we may show that the values of 72/0-1 and 72/c are similar, namely: 


0.8298 < < 1.2213 , 
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which we prove by writing 
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Bringing it all together, we have 
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inequality in (13) holds. 


Using the upper bound in (19), we also have 0.09 < — k. Hence, the second 


The inequalities for the other parameters are verified similarly. For instance. 
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holds for Of = Therefore, the first inequality in (13) continues to be true. 


We conclude that running the EM algorithm from the chosen starting values 


yields a sequence of parameter vectors that satisfy the inequalities (13)-(16). The 


sequence has at least one limit point, which must be a non-trivial critical point of 
the log-likelihood function. Therefore, for every k = 1,..., K, the log-likelihood 
function has a non-trivial critical point with fj,i € (k, k + 0.2). □ 


4 Conclusion 

We showed that the maximum likelihood estimator (MLE) in Gaussian mixture 
models is not an algebraic function of the data, and that the log-likelihood 
function may have arbitrarily many critical points. Hence, in contrast to the 
models studied so far in algebraic statistics 1518112119) . there is no notion of an 
ML degree for Gaussian mixtures. However, certified likelihood inference may 
still be possible, via transcendental root separation bounds, as in m- 

Remark 3. The Cauchy-location model, treated in m is an example where the 
ML estimation is algebraic but the ML degree, and also the maximum number 
of local maxima, depends on the sample size and increases beyond any bound. 

Remark 4- The ML estimation problem admits a population/infinite-sample ver¬ 
sion. Here the maximization of the likelihood function is replaced by minimiza¬ 
tion of the Kullbaek-Leibler divergence between a given data-generating distribu¬ 
tion and the distributions in the model. The question of whether this population 
problem is subject to local but not global maxima was raised in m —in the 
context of Gaussian mixtures with known and equal variances. It is known that 
the Kullbaek-Leibler divergence for such Gaussian mixtures is not an analytic 
function §7-8]. Readers of Japanese should be able to find details in [55]. 

As previously mentioned, Theorem shows that likelihood inference is in 
a fundamental way more complicated than the classical method of moments 
US]. The latter involves only the solution of polynomial equation systems. This 
was recognized also in the computer science literature on learning Gaussian 
mixtures [3 1T0IT4] , where most of the recent progress is based on variants of the 
method of moments rather than likelihood inference. We refer to [T] for a study 
of the method of moments from an algebraic perspective. Section 3 in that paper 
illustrates the behavior of Pearson’s method for the sample used in Theorem [^ 
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